[GLUTEN-10113][VL] Refactor hive connector register to accept session level config#10585
[GLUTEN-10113][VL] Refactor hive connector register to accept session level config#10585zhouyuan wants to merge 1 commit intoapache:mainfrom
Conversation
0f51d4b to
62eb36e
Compare
cpp/velox/utils/ConfigExtractor.cc
Outdated
| hiveConfMap[facebook::velox::connector::hive::HiveConfig::kOrcUseColumnNames] = "true"; | ||
|
|
||
| return std::make_shared<facebook::velox::config::ConfigBase>(std::move(hiveConfMap)); | ||
| return std::make_shared<facebook::velox::config::ConfigBase>(std::move(hiveConfMap), false/*mutable*/); |
There was a problem hiding this comment.
The default value of parameter 'mutable' is false, no need to pass it.
|
|
||
| // register the hive connectors | ||
| std::call_once( | ||
| gluten::VeloxBackend::get()->regFlag, [&]() { gluten::VeloxBackend::get()->initConnector(veloxCfg_); }); |
There was a problem hiding this comment.
Wonder if we could register with a unique connector ID for each WholeStageResultIterator and call velox::connector::unregisterConnector from the dtor?
The connector ID can use similar naming as the queryID
fmt::format(
"Gluten_Stage_{}_TID_{}_VTID_{}",
std::to_string(taskInfo_.stageId),
std::to_string(taskInfo_.taskId),
std::to_string(taskInfo_.vId))
There was a problem hiding this comment.
it looks like in Spark case, sometimes two stages will be scheduled to run at the same time.
| void VeloxBackend::initConnector(const std::shared_ptr<velox::config::ConfigBase>& hiveConf) { | ||
| for (const auto& [key, value] : hiveConf->rawConfigs()) { | ||
| // always update to use new session level conf | ||
| backendConf_->set(key, value); |
There was a problem hiding this comment.
If the initConnector is only called from WholeStageResultIterator, is it possible to remove the filesystem related configurations from the backend conf?
There was a problem hiding this comment.
Users may pass/modify the filesystem related configurations at session level, by setting the keys here it will allow us to pick the latest values
3d4b7db to
ff0de62
Compare
| @@ -294,15 +293,24 @@ void VeloxBackend::initCache() { | |||
| } | |||
|
|
|||
| void VeloxBackend::initConnector(const std::shared_ptr<velox::config::ConfigBase>& hiveConf) { | |||
There was a problem hiding this comment.
Should we move the function to WholeStageResultIterator?
| task_->setSpillDirectory(spillDir); | ||
|
|
||
| // register the hive connectors | ||
| std::call_once( |
There was a problem hiding this comment.
In what case, we may register twice?
| const std::unordered_map<std::string, std::string>& conf) { | ||
| backendConf_ = | ||
| std::make_shared<facebook::velox::config::ConfigBase>(std::unordered_map<std::string, std::string>(conf)); | ||
| backendConf_ = std::make_shared<facebook::velox::config::ConfigBase>( |
There was a problem hiding this comment.
mutable true make copy the configs when iterating the configs
There was a problem hiding this comment.
Could we generate the HiveConfig by veloxCfg_ in WholeStageResultIterator every time?
There was a problem hiding this comment.
No, here veloxCfg_ contains only session level configurations, it does not cover the configurations defined in spark.conf
8faee76 to
f722074
Compare
25dfdda to
0d6201b
Compare
Signed-off-by: Yuan <yuanzhou@apache.org> remove duplicated call Signed-off-by: Yuan <yuanzhou@apache.org> revert to use inmutable config map Signed-off-by: Yuan <yuanzhou@apache.org> fix to use dynamic config Signed-off-by: Yuan <yuanzhou@apache.org> fix return config copy Signed-off-by: Yuan <yuanzhou@apache.org> unregister hive connector before register Signed-off-by: Yuan <yuanzhou@apache.org> register io connector before visit filesystem Signed-off-by: Yuan <yuanzhou@apache.org> refactor Signed-off-by: Yuan <yuanzhou@apache.org> Revert "refactor" This reverts commit 007472e. Revert "Revert "refactor"" This reverts commit 4d2acca. omit register if same session config detected Signed-off-by: Yuan <yuanzhou@apache.org>
0d6201b to
701e64c
Compare
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks. |
What changes are proposed in this pull request?
This patch adds a refactor on hive connector registration. instead of registering when instancing Velox, we defer it to when creating the wholestage context. This allows us to get the session level configurations, and pass to hive/velox
How was this patch tested?
local verified and pass existing GHA